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Abstract 

Off-policy learning in dynamic decision prob¬ 
lems is essential for providing strong evidence 
that a new policy is better than the one in 
use. But how can we prove superiority with¬ 
out testing the new policy? To answer this 
question, we introduce the G-SCOPE algo¬ 
rithm that evaluates a new policy based on data 
generated by the existing policy. Our algo¬ 
rithm is both computationally and sample ef¬ 
ficient because it greedily learns to exploit fac¬ 
tored structure in the dynamics of the envi¬ 
ronment. We present a finite sample analy¬ 
sis of our approach and show through exper¬ 
iments that the algorithm scales well on high¬ 
dimensional problems with few samples. 

1. Introduction 

Reinforcement Learning (RL) algorithms learn to maxi¬ 
mize rewards by analyzing past experience with an un¬ 
known environment. Most RL algorithms assume that 
they can choose which actions to explore to learn quickly. 
However, this assumption leaves RL algorithms incom¬ 
patible with many real-world business applications. 

To understand why, consider the problem of on-line ad¬ 
vertising: Each customer is successively presented with 
one of several advertisements. The advertiser’s goal is to 
maximize the probability that a user will click on an ad. 
This probability is called the Click Through Rate (CTR, 
Richardson et al. 2007). A marketing strategy, called a 
policy, chooses which ads to display to each customer. 
However, testing new policies could lose money for the 
company. Therefore, management would not allow a 


new policy to be tested unless there is strong evidence 
that the policy is not worse than the company’s existing 
policy. In other words, we would like to estimate the 
CTR of other strategies using only data obtained from 
the company’s existing policy. In general, the problem 
of determining a policy’s value from data generated by 
another policy is called off-policy evaluation, where the 
policy that generates the data is called the behavior pol¬ 
icy, and the policy we are trying to evaluate is called the 
target policy. This problem may be the primary reason 
batch RL algorithms are hardly used in applications, de¬ 
spite the maturity of the field. 

A simple approach to off-policy evaluation is given by 
the MFMC algorithm (Fonteneau et al., 2010), which con¬ 
structs complete trajectories for the target policy by con¬ 
catenating partial trajectories generated by the behavior 
policy. However, this approach may require a large num¬ 
ber of samples to construct complete trajectories. One 
may think that the number of samples is of little impor¬ 
tance, since Internet technology companies have access 
to millions or billions of transactions. Unfortunately, the 
dimensionality of real-world problems is generally large 
(e.g., thousands or millions of dimensions) and the events 
they want to predict can have extremely small probability 
of occurring. Thus, sample efficient off-policy evaluation 
is paramount. 

An alternative way of looking at the problem is through 
counterfactual (CF) analysis (Bottou et al., 2013). Given 
the outcome of an experiment, CF analysis is a frame¬ 
work for reasoning about what would have happened if 
some aspect of the experiment was different. In this pa¬ 
per, we focus on the question: what would have been the 
expected reward received for executing the target policy 
rather than the behavior policy? One approach that falls 




naturally into the CF framework is Importance Sampling 
(IS) (Bottou et al., 2013; Li et al., 2014). IS methods 
evaluate the target policy by weighting rewards received 
by the behavior policy. The weights are determined by 
the probability that the target policy would perform the 
same action as the one prescribed by the behavior pol¬ 
icy. Unfortunately, IS methods suffer from high variance 
and typically assume that the behavior policy visits every 
state that the target policy visits with nonzero probability. 

Even if this assumption holds, IS methods are not able to 
exploit structure in the environment because their estima¬ 
tors do not create a compact model of the environment. 
Exploiting this structure could drastically improve the 
quality of off-policy evaluation with small sample sizes 
(relative to the dimension of the state-space). Indeed, 
there is broad empirical support that model-based meth¬ 
ods are more sample efficient than model-free methods 
(Hester & Stone, 2009; Jong & Stone, 2007). However, 
one broad class of compact models are Factored-state 
Markov Decision Processes (FMDPs, Kearns & Koller 
1999; Strehl et al. 2007; Chakraborty & Stone 2011). 
An FMDP model can often be learned with a number 
of samples logarithmic in the total number of states, 
if the structure is known. Unfortunately, inferring the 
structure of an FMDP is generally computationally in¬ 
tractable for FMDPs with high-dimensional state-spaces 
(Chakraborty & Stone, 2011), and in real-world prob¬ 
lems the structure is rarely known in advance. 

Ideally, we would like to apply model-based methods 
to off-policy evaluation because they are generally more 
sample efficient than model-free methods such as MFMC 
and IS. In addition, we want to use algorithms that are 
computationally tractable. To this end, we introduce 
G-SCOPE, which learns the structure of an FMDP greed¬ 
ily. G-SCOPE is both sample efficient and computation¬ 
ally scalable. Although G-SCOPE does not always learn 
the true structure, we provide theoretical analysis relating 
the number of samples to the error in evaluating the target 
policy. Furthermore, our experimental analysis demon¬ 
strates that G-SCOPE is significantly more sample effi¬ 
cient than model-free methods. 

The main contributions of this paper are: 

• a novel, scalable method for off-policy evaluation 
that exploits unknown structure, 

• a finite sample analysis of this method, and 

• a demonstration through experiments that this ap¬ 
proach is sample efficient. 


main theorem and its analysis are given in Section 4. 
Section 5 presents experiments. In Section 6, we discuss 
limitations of G-SCOPE and future research directions. 


2. Background 

We consider dynamics that can be represented by a 
Markov Decision Process (MDPs; Puterman 2009): 

Definition 1. A Markov Decision Process (MDP) is a 
tuple (S, A, P(s , |s, a), R(s, a), p) where S is the state 
space, A is the action space, P represents the transi¬ 
tion probabilities from every state-action pair to another 
state, R represents the reward function fitting each state- 
action pair with a random real number, and p is a distri¬ 
bution over the initial state of the process. 

We denote by 7r a Markov policy that maps states to a 
distribution over actions. The process horizon is T, and 
applying a policy for T steps starting from so ~ p re¬ 
sults in a cumulative reward known as the value function: 
V^(ao) = E J2t=o R ( s ti a t)\so,tr , where the expec¬ 
tation is taken with respect to P, R and 7r. We assume R 
is known and immediate rewards are bounded in [0,1]. 

The system dynamics is as follows: First, an initial 
state so is sampled from p. Then, for each time step 
t = 0,..., T — 1, an action at is sampled accord¬ 
ing to the policy 7r(s t ), a reward r t is awarded accord¬ 
ing to R(st,at) and the next state st+i is sampled by 
Pr(-|sj, a t ). The quantity of interest is the expected pol¬ 
icy value i/ n = p r V' K . 


2.1. Off-Policy Evaluation 

We consider the finite horizon batch setup. Given are H 
trajectories of length T sampled from an MDP with an 
initial state distribution p and behavior policy tt/,. The 
off-policy evaluation problem is to estimate the T-step 
value of a target policy 7r (different from 7^,). For the tar¬ 
get policy 7T, we aim to minimize the difference between 
the true and estimated policy value: 

( 1 ) 


2.2. Factored MDPs 

Suppose the state space can be decomposed into D 
discrete values. We denote the i th variable of A by 
X_(i), and for a given subset of indices T C [D\ 4 
{1,2, ..,79}, let X('L) be the subset of corresponding 
variables {X(*)}, e ^. We define a factored MDP, sim¬ 
ilar to Guestrin et al. 2003: 


The paper is organized as follows. In Section 2, we de¬ 
scribe the problem setting and notations. Section 3 elab¬ 
orates on our greedy structure learning algorithm. Our 


Definition 2. A Factored MDP (FMDP) is an MDP 
(S, A , P, R , p) such that the state X_ £ S is composed 
of a set of D variables {Xfi)}f =1 , where each variable 
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can take values from a finite domain, such that the proba¬ 
bility of the next state Y_ given that action a is performed 
in state X_ satisfies 

D 

Pr(m.«) = n Pr (£(*) I*, a) ■ (2) 

i—1 

For simplicity, we assume that all variables lie in the 
same domain T, i.e., X_ £ V D , where T is a finite set. 
Furthermore, each variable in the next state Yfi) only 
depends on a subset of variables X($i) where <f>i Q[D]- 
The indices in <1> ( are called the parents of i. When the 
size of the parent sets are smaller than D, then the FMDP 
can be represented more compactly: 

D 

Pr (Y]X, a) = Pr(F(f)|X($ i ), a) . (3) 

i=l 

Before delving into the algorithm and the analysis, we 
provide some notation. For a subset of indices 'P C [D], 
a realization-action pair (v,a) £ x A is a spe¬ 

cific instantiation of values for the corresponding vari¬ 
ables X(T'), a. We denote by F t = x A the set of 
all realization-action pairs for the parents of node i, and 
mark A = F t . 

The following quantities are used in the algorithm and 
consecutive analysis: Denote by \P C [D] a subset of in¬ 
dices and by v £ a realization of the corresponding 
variables: 

Pi'GlOMW = v,a) = 

ELi Pi-(Z(0,^(^) = v,a,t) 
Ef=i p KAW = v,a,t) 

K(r(i) = y\X(V) = V, a) ± n (V' v '°> , ( 4 ) 

n(v, a) 

where the probabilities in the right term of the first equa¬ 
tion are conditioned on the behavior policy 7r& omitted 
for brevity. Note that if T D 4>j then Pr(F(f)|X(\P) = 
v,a) = Pr(y(i)|X($j) = u(<bi),a), and the policy de¬ 
pendency cancels out. 

2.3. Previous Work 

Previous works on FMDPs focus on finding the optimal 
policy. Early works assumed the dependency structure 
is known (Guestrin et al., 2002; Kearns & Roller, 1999). 
Degris et al. (2006) proposed a general framework for 
iteratively learning the dependency structure (this work 
falls within this framework), yet no theoretical results 
were presented for their approach. SLF-Rmax (Strehl 
et al., 2007), Met-Rmax (Diuk et al., 2009) and LSE- 
Rmax (Chakraborty & Stone, 2011) are algorithms for 


learning the complete structure. Only the first two re¬ 
quire as input the in-degree of the DBN structure. The 
sample complexity of these algorithms is exponential in 
the number of parents. Finally, learning the structure of 
DBNs with no related reward is in itself an active re¬ 
search topic (Friedman et al., 1998; Trabelsi et al., 2013). 

There has also been increasing interest in the RL commu¬ 
nity regarding the topic of off-policy evaluation. Works 
focusing on model-based approaches mainly provide 
bounds on the value function estimation error. For exam¬ 
ple, the simulation lemma (Kearns & Singh, 2002) can be 
used to provide sample complexity bounds on such er¬ 
rors. On the other hand, model free approaches suggest 
estimators while trying to reduce the bias. Precup (2000) 
presents several methods based on applying importance 
sampling on eligibility traces, along with an empirical 
comparison; Thomas et al. (2015) had analyzed bounds 
on the estimation error for this method. A different ap¬ 
proach was suggested by Fonteneau et al. (2010): evalu¬ 
ate the policy by generating artificial trajectories - a con¬ 
catenation of one-step transitions from observed trajecto¬ 
ries. The main problem of these approaches besides the 
computational cost is that a substantial amount of data 
required to generate reasonable artificial trajectories. 

3. Algorithm 

In general, inferring the structure of an FMDP is expo¬ 
nential in D (Strehl et al., 2007). Instead, we propose a 
naive greedy algorithm which under some assumptions 
can be shown to provide small estimation error on the 
transition function (G-SCOPE - Algorithm 1). 

G-SCOPE (Greedy Structure learning of factored MDPs 
for Off-Policy Evaluation) receives off-line batch data, 
two confidence parameters e, S and a minimum accept¬ 
able score C 2 . The outputs T, are the estimated parents 
of each variable i. In the inner loop, the set 0 is defined 
as the set of all realization-action pairs which had been 
observed at least N(e, S) times; These are the only pairs 
further considered. We then greedily add to 4>, the j’th 
variable which maximizes the L\ difference between the 
old distribution depending only on T,, and a distribu¬ 
tion conditioned on the additional variable as well. Par¬ 
ents are no longer added when that difference is small, or 
when all possible realizations were not observed N(e, 6) 
times. The computational complexity of a naive imple¬ 
mentation is 0(FfTTD 2 ), since G-SCOPE sweeps the 
data for every input and output variable. 

The main idea beyond G-SCOPE is that having enough 
samples will result in an adequate estimate of the con¬ 
ditional probabilities. Then, under appropriate regularity 
assumptions (stated in Section 4), adding a non parent 
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Algorithm 1 G-SCOPE(// T-length traj., e, S, C 2 = 0) 
for i = 1 to D do 
<b, 4= 0 

repeat 

©i <= {(v,v(j).«) e r '* i,+1 x A: j £ [£>]\ 

$*, |n(u,it;(j),a)| > N(e,S)} 

ForAT( e ^) = ^ln(f) 

if | © | = 0 then 
Break 

end if 

for j = 1 to D do 

diff k T0Cl<XX.( vv (j^ a ^£:Q 

\\PT{Y(i)\X(huj) = (v,v(j)),a) 
-Pi(Y(%)\X{* i )=v,a)\\ 1 

end for 

j* <= argmax je[D] diffj 
if diffj* > C ‘2 + 2e then 

4 <= 4 U j* 

end if 

until diffy. < C 2 + 2e 

end for 
return 


variable is unlikely. If parents have a higher effect than 
non-parents on the L\ distance and non-parents have 
a weak effect, the argmax procedure will most likely 
return only parents. When all prominent parents were 
found, or when there is not enough data for further infer¬ 
ence, the algorithm stops. Once the set of assumed par¬ 
ents is available, we can build an estimated model and 
simulate any policy. 

An important property of the G-SCOPE algorithm is that 
it does not necessarily find the actual parents. Instead, 
we settle on finding a subset of variables providing prob¬ 
ably approximately correct transition probabilities. As a 
result, the number of considered parents scales with data 
available, a desired quality linking the model and sample 
complexity. Since we do not necessarily detect all par¬ 
ents, non-parents can have a non-zero influence on the 
target variable after all prominent parents have been de¬ 
tected. To avoid including these non-parents, the thresh¬ 
old to add a parent is C-> plus some precision parame¬ 
ters. In practice, we use C 2 = 0 because including non¬ 
parents with an indirect influence on Y_(i) may improve 
the quality of the model. However, in our analysis, we 
present Assumptions under which the true parents can 
be learned and explain C 2 . 

Finally, G-SCOPE can be modified to encode and con¬ 
struct the conditional probability distributions using de¬ 
cision trees. A different decision tree is constructed for 
each action and variable in the next state. Tree based 


models can produce more compact representations of the 
model than encoding the full conditional probability ta¬ 
bles specified by 4>,. While we analyze G-SCOPE as an 
algorithm that separates structure learning from estimat¬ 
ing the conditional probability tables, for simplicity and 
clarity, in our experiments, we actually use a decision 
tree based algorithm. The modifications to the analysis 
for the tree based algorithm would add unnecessary com¬ 
plexity and distract from the key points of the analysis. 

4. Analysis 

By using a scalable but greedy approach to structure 
learning rather than a combinatorially exhaustive one, 
G-SCOPE can only learn arbitrarily well a subclass of 
models. In this section, we introduce three assumptions 
on the FMDP that describe this subclass, and then ana¬ 
lyze the policy evaluation error for this subclass. 

We divide ( l>, into non-overlapping “weak” (<1>" ; ) and 
“strong” parents. These subsets will be defined for¬ 
mally later, but intuitively, parents in 4>f have a large in¬ 
fluence on Y_(i) and are easy to detect while parents in 40 
have a smaller influence that may be below the empirical 
noise threshold and hence not be detected. Our assump¬ 
tions state that (1) “strong” parents are sufficiently bet¬ 
ter than non-parents to be detected by G-SCOPE before 
non-parents; (2) conditionally on “strong” parents, non¬ 
parent have too little influence on Y_[i) to be accepted by 
G-SCOPE and (3) conditioning on some “weak” parents 
does not increase the influence of other “weak” parents. 
The first two assumptions are used to bound the proba¬ 
bility that G-SCOPE adds non parents in <t>, or does not 
add some strong parents, the last one to bound the error 
caused by the potential non-detection of weak parents. 

Assumption 1. Strong parent superiority. For every 
i £ [D], there exists a “strong” subset of parents 

4>f C 4>i such that V'T C <!>*, 0, j £ D\^ h 

(v , v{j), a) £ rl* u U>l x A, there exists k £ 4>£\4/, such 
thatf(v',v'(k),a') £ x A : for some C\ > 0, 

|| Pr(y(z)|X('T U {k}) = (1/, v'(k)), a 1 ) 

-Pr(F(*)|A(vk)=^a')||i> 

||Pr(Z(*)|A(*U{j}) = (v,v{j),a) 
-Pr(y(t)| = X;($)=v,o)||i + C'i . 

Assumption 1 ensures that, in terms of influence on the 
conditional distribution of the target, G-SCOPE finds at 
least one “strong” parent variable k more attractive than 
any non-parent variable j as long as ^ 0. This 

prevents extreme cases where due to large correlation be¬ 
tween parents and non-parents factors, large numbers of 
non-parents could be added before finding the actual par¬ 
ents, thus considerably increasing the sample complex- 
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Figure 1. An FMDP that fails to satisfy Assumption 1. The 
factorization for a given action (not shown on the figure) is rep¬ 
resented as a dynamic Bayesian network. States not relevant 
for the explanation are omitted. In the conditional transition 
probability tables, rows correspond to possible values of parent 
variables and columns to possible values of the variable. Cells 
at the intersection contain conditional probability values. 


ity. Ci quantifies how much more information a true par¬ 
ent will provide than non-parents. The larger C\ the less 
likely G-SCOPE will add a non-parent in <1^. 

Figure 1 illustrates a subset of the state variables and 
corresponding conditional transition probability distribu¬ 
tions of an FMDP that, for the action implicitly consid¬ 
ered, does not satisfy Assumption 1. In this setting, for 
t > 3 and considering T = 0, we have 

II Pr(Z(3)) - Pr(F(3)|X(3) = f)||r = 2 Vi £ {1, 2} 
||Pr(y(3)) -Pr(F(3)|X(l) =i)||i = 1 Vj £ {1,2}. 

G-SCOPE would add X(3), a non-parent, before any 
true parent of Y (3) in the estimated parent set. Note 
that in this particular case it does not matter, as X (3) 
perfectly determines Y_ (3). However, adding noise in the 
transition probabilities would make X(3) less accurate 
than X(l) and Xf2) together. 

Assumption 2. Non-parent conditional weakness. For 

every i £ [D], <!>}' as in Assumption 1, VU/ : $• C $ C 
<f>j, j £ D\< f>j, (v,v(j),a) £ pl^' u D}l x A -.for some 

C 2 > 0, 


||Pr(y(f)|X(T'U{.}}) = (v ,v(j),a) 

-Pr(Z(0!£(*)=«,a)lit <C 2 . 

Assumption 2 ensures that, after G-SCOPE has detected 
all strong parents, non-parents have low influence on the 
target variable and therefore G-SCOPE has a low proba¬ 
bility to add them to <1», : . If , then C- 2 = 0. 

Assumption 3. Conditional diminishing returns. There 
exists C 3 > 0 such that for every i £ [D\, <Ff as in 
Assumptions 1 and 2, Ik : <Ff C vp C j, k £ $>i \ 'T, 
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Figure 2. An FMDP that does not satisfy Assumption 3. See 
Figure 1 for an explanation of the representation. 


(v,v(j),v(k),a) £ pl' I 'l+ 2 x A, if 

||Pr(Z(0l^(^ u {i}) = (v,v(j)),a) 
-Pr(Y(i)|X(*)=u,a)||i> 

II Pr(Y(*)l2f(\E' U {&;}) = (v,v(k)), a) 

-Pr(r(*)|2C(*) = t>,o)|| 1 , 


then: 

||Pr(Z(i)|Z(^U{y}) = (u, v(j)), a) 

— Pr(y(*)|A(tk) = v, a)||i > 

II Pr(Z(*)|Z(^ U {j,k}) = (v,v(j),v(k),a) 
-Pr(y(f)|X(T'U{j}) = (v,v(j)), a)||i + C 3 . 

If conditioning on X(j) provides more knowledge on 
the output distribution than conditioning on another vari¬ 
able Xfk), then it will also provide more knowledge than 
conditioning on Xfk) given Xfj). In simple words. As¬ 
sumption 3 means that information inferred from vari¬ 
ables is monotonic, so influential parents cannot go un¬ 
detected. This assumption supports our greedy scheme, 
but there are trivial cases where it does not hold. 

Consider the substructure represented in Figure 2: 

l|Pr(Y(3)|A(l)=i)-Pr(F(3))|| 1 ^ 

S -V-' 

=0 

|| Pr(Z(3)|A(l,2) = (i,j)) - Pr(F(3)|X(l) = OIU • 

'-V-' 

=1 

Even though A(l, 2) are together very informative about 
variable Yf 3), any single one of them is not. In such a 
situation, useful variables cannot be detected by a greedy 
scheme. Assumption 3 prevents this problem. 

These assumptions form the core hardness of the struc¬ 
ture learning problem. From one side, there may be im¬ 
plicit dependencies between variables induced by the dy¬ 
namics - making it hard to separate non-parents. From 
the other side, the conditional probabilities may belong 
to a family of XOR like function - initially hiding at¬ 
tractive true parents. Finally, while these assumptions 
are crucial for proper analysis, non-parent variables may 

































































have a beneficial effect on the actual evaluation error as 
they still contain information on the true parents values, 
and subsequently information on the output variable. 

Theorem 1. Suppose Assumptions 1, 2 and 3 hold, and 
let ^ > e+ e > 0 , > 0 , andm = maxj e [£>] |$j|. 

Then there exists 

such that if G-SCOPE is given H trajectories, with prob¬ 
ability at least 1 — 2AD(m + 2 )(D + 1 — m)T m+1 Si, 
G-SCOPE returns an evaluation of it satisfying: 

\v~v\ <T 2 (S* +e*D) (9) 


where 


e* = (4m + l)e + mCi + m 2 C 3 

D 

S* = AT m J2 iMi 

i =1 


fa 


Yj=i Pr (K. t ($i) =v,a t =a\n) 

max rji - 

(v,a)eFi Yj t=1 Pr(X t ($») = v,a t = a\n b ) 


( 10 ) 


The proof of Theorem 1 is divided in 4 parts, detailed in 
the supplementary material. First, we derive a simula¬ 
tion lemma for MDPs stating that for the target policy 
two MDPs with similar transition probability distribu¬ 
tions have proximate value functions. We then consider 
the number of samples needed to estimate the transition 
probabilities of various realization-action pairs. Samples 
within a trajectory may not be independent so we de¬ 
rive a bound based on Azuma’s inequality for martin¬ 
gales. Subsequently, we consider the number of trajec¬ 
tories needed to derive a model that evaluates the target 
policy accurately. If the behavior policy visits enough 
the parent realizations that the target policy is likely to 
visit, then the number of trajectories can be small. On 
the other hand, if the behavior never visits parent real¬ 
izations that the target policy visits, then the number of 
trajectories may be infinite. This is captured by r/>j. Fi¬ 
nally, we bound the error due to greedy parent selection 
under Assumptions 1, 2 and 3. 

The evaluation error bound depends on the horizon T, on 
the number of variables D , on the error bound e* on most 
transition probability values of the FMDP constructed by 
G-SCOPE and on the probability TS* that a trajectory 
will not visit a state with badly estimated probability val¬ 
ues. The dependency of e* on m is the first advantage 
of the factorization. The constants C\, ('•> and C 3 , from 
Assumptions 1, 2 and 3, respectively, indicate the effect 
of the model “hardness” on the bound. When C\ is large 
enough and C 2 = C 3 = 0 , the true structure can be 


learned greedily and the error can be driven arbitrarily 
close to 0. In other cases, G-SCOPE may learn the wrong 
structure resulting in some approximation error. 

Next, observe the probability that the bounds in Theo¬ 
rem 1 hold. The multiplicative term AT m is unavoid¬ 
able since for each parents realization and action pair 
the estimation error on the transition probability must be 
bounded. The main advantage of this theorem is the lack 
of a F D multiplicative term, which means the effective 
state space decreased exponentially. The factor m + 2 
is due to the number of iterations of G-SCOPE where 
a parent is added, and D — m + 1 is due to bounds on 
non-parents that must be valid for all these iterations. 

In 5*, the ipi values characterize the mismatch between 
the behavior policy and the target policy. If the behav¬ 
ior policy visits all of the parent-action realizations that 
the target policy visits with sufficiently high probability, 
then the ipi parameters will be small. But if the target pol¬ 
icy visits parent-action realizations that are never visited 
by the behavior policy, then the ^ values may be infi¬ 
nite. The ipi values are similar to importance sampling 
weights used by some model-free oil-policy algorithms. 
However, unlike model-free approaches that depend on 
the differences in the state visitation distributions of the 
behavior policy and the target policy, the tpi values de¬ 
pend on the differences in the parent realization visitation 
distributions between the behavior policy and the target 
policy. This is more flexible because the t/>* values can 
be small even when the behavior policy and the target 
policy visit different regions of the state-space. 

5. Experiments 

We compared G-SCOPE to other off-policy evaluation 
algorithms in the Taxi domain (Dietterich, 1998), ran¬ 
domly generated FMDPs, and the Space Invaders domain 
(Bellemare et ah, 2013). Since the domains compared in 
our experiments have different reward scales, we normal¬ 
ized the errors to compare ■ In all experiments, the 
behavior policy differs from the target policy. Further¬ 
more, evaluation error always refers to the target policy’s 
evaluation error, and all trajectory data is generated by 
the behavior policy. 

We compare G-SCOPE to the following algorithms: 

• Model-Free Monte-Carlo (MFMC, Fonteneau et al. 
2010 ): a model-free off-policy evaluation algorithm 
that constructs artificial target policies by concate¬ 
nating partial trajectories generated by the behavior 
policy, 

• Clipped Importance Sampling (CIS, Bottou et ah 
2013): a model-free importance sampling algorithm 
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that uses a heuristic approach to clip extremely large 
importance sampling ratios, 

• Flat : a flat model-based approach that assumes 
no structure between any two state-action pairs and 
simply builds an empirical next state distribution for 
each state-action pair, and 

• Known Structure (KS) : a model-based method that 
is given the true parents, but still needs to estimate 
the conditional probability tables from data gener¬ 
ated by the behavior policy. KS should outperform 
G-SCOPE, because KS knows the structure. We in¬ 
troduce KS to differentiate the evaluation error due 
to insufficient samples from the evaluation error due 
to G-SCOPE selecting the wrong parent variables. 

Our experimental results show that (1) model-based 
off-policy evaluation algorithms are more sample ef¬ 
ficient than model-free methods, (2) exploiting struc¬ 
ture can dramatically improve sample efficiency, and (3) 
G-SCOPE often provides a good evaluation of the target 
policy despite its greedy structure learning approach. 

5.1. Taxi Domain 

The objective in the Taxi domain (Dietterich, 1998) is for 
the agent to pickup a passenger from one location and to 
drop the passenger off at a destination. The state can be 
described by four variables. We selected the initial state 
according to a uniform random distribution and used a 
horizon T = 200. The behavior policy selected actions 
uniform randomly, while the target policy was derived 
by solving the Taxi domain with the Rmax algorithm 
(Brafman & Tennenholtz, 2002). We discovered that the 
deterministic policy returned by Rmax was problematic 
for CIS, because the probability of almost all trajectories 
generated by the behavior policy were 0 with respect to 
the target policy. To resolve this problem, we modified 
the policy returned by Rmax to ensure that every action is 
selected in every state with probability at least e = 0.05. 

The Taxi domain is a useful benchmark because we know 
the true structure and the total number of states is only 
500. Thus, we can compare G-SCOPE to KS and Flat. 

Figure 3 presents the normalized evaluation error (on a 
log-scale) for MFMC, CIS, Flat, KS, and G-SCOPE 
over 2,000 trajectories generated by the behavior policy. 
Median and quantiles are estimated over 40 independent 
trials. For intermediate and large number of trajectories, 
G-SCOPE performs about the same as if the structure is 
given and achieves smaller error than the model-free al¬ 
gorithms (MFMC and CIS). Notice that MFMC, CIS, and 
Flat, which do not take advantage of the domains struc¬ 
ture, require a large number of trajectories before they 



Figure 3. Taxi domain: Median evaluation error for the target 
policy (shaded region: 1 st — 3 rd quantiles) on log-scale for 
MFMC, CIS, Flat, KS, and G-SCOPE varying the number 
of trajectories generated by the behavior policy. Without ex¬ 
ploiting structure MFMC and Flat require many trajectories to 
achieve small evaluation error. Yet, KS and G-SCOPE achieve 
small evaluation error with just a few trajectories. Because 
G-SCOPE adapts the complexity of the model to the samples 
available, it achieve smaller estimation error than even KS for 
extremely few trajectories. 

achieve low evaluation error. Interestingly, the Flat 
(model-based) approach appears to be more sample ef¬ 
ficient than MFMC, which is in line with observations that 
model-based RL is more efficient than model-free RL 
(Hester & Stone, 2009; Jong & Stone, 2007). KS and 
G-SCOPE, on the other hand, achieve low evaluation er¬ 
ror after just a few trajectories and have similar perfor¬ 
mance, except for very few trajectories where G-SCOPE 
can adapt the model complexity to the number of sam¬ 
ples and therefore achieves a lower evaluation error than 
the algorithm knowing the structure. This provides one 
example where greedy structure learning is effective. 

5.2. Randomly Generated Factored Domains 

To test G-SCOPE in a higher dimensional problem, 
where we still know the true structure, we randomly gen¬ 
erated FMDPs with D = 20 dimensional states. The do¬ 
main of each variable was T = {1,2}. For each state 
variable the number of parents was uniformly selected 
from 1 to 4 and the parents were also chosen randomly. 
Afterwards, the conditional probability tables were filled 
in uniformly and normalized to ensure they specified 
proper probability distributions. The FMDP was given 
a sparse reward function that returned 1 if and only if the 
last bit in the state-vector was 1 and returned 0 otherwise. 
We used a horizon T = 200. The behavior policy se¬ 
lected actions uniform randomly, while the target policy 
was derived by running SARSA(Sutton & Barto, 1998) 
with linear value function approximation on the FMDP 











G-SCOPE G-SCOPE 

Figure 4. Random FMDP domain: Average evaluation error 
(±1 std. deviation) on log-scale for MFMC, KS, and G-SCOPE 
(with H = 20 and 200 trajectories). G-SCOPE has slightly 
worse performance than Known Structure, but G-SCOPE 
achieves significantly lower evaluation error than MFMC. 


for 5,000 episodes with a learning rate 0.1, discount fac¬ 
tor 0.9, and epsilon-greedy parameter 0.05. After train¬ 
ing SARSA, we extracted a stationary target policy. As 
in the Taxi domain, we modified the policy returned by 
SARSA to ensure that every action could be selected in 
every state with probability at least e = 0.05. 

For the randomly generated FMDPs, we could not con¬ 
struct a flat model because there are 2 20 = 1,048, 576 
states and the number of parameters in a flat model scales 
quadratically with the size of the state-space. However, 
we could still compare MFMC, CIS, KS, and G-SCOPE. 

Figure 4 presents the normalized evaluation error (on 
a log-scale) for MFMC, CIS, KS, and G-SCOPE given 
H = 20 and H = 200 trajectories from the behavior pol¬ 
icy. Average and standard deviations are estimated over 
10 independent trials. MFMC fails because in this high¬ 
dimensional task there is not enough data to construct 
artificial trajectories for the target policy. CIS fairs only 
slightly better than MFMC, because it uses all of the tra¬ 
jectory data. Unfortunately, most of the trajectories gen¬ 
erated by the behavior policy are not probable under the 
target policy and its evaluation of the target policy is pes¬ 
simistic. G-SCOPE has slightly worse performance than 
KS, but G-SCOPE achieves significantly lower evalua¬ 
tion error than MFMC and CIS. 

5.3. Space Invaders 

In the Space Invaders (SI) domain using the Arcade 
Learning Environment (Bellemare et al., 2013), not only 
do we not know the parent structure, we also cannot ver¬ 
ify that the factored dynamics assumption even holds (2). 
Thus, SI presents a challenging benchmark for off-policy 
evaluation. We used the 1024-bit RAM as the state vec¬ 
tor. We set the horizon T = 1000 so that the behavior 
policy would experience a diverse set of states. 

As in the previous experiment, the behavior policy se¬ 


H = 40 ff = 2f)0 



Figure 5. Space Invaders domain: Average evaluation error 
(±1 std. deviation) for MFMC, CIS, and G-SCOPE (with H = 
40 and 200 trajectories). G-SCOPE achieves significantly 
lower evaluation error than MFMC and CIS. 


lected actions uniformly at random, while the target pol¬ 
icy was derived by running SARSA (Sutton & Barto, 
1998) with linear value function approximation on the 
FMDP with a learning rate 0.1, discount factor 0.9, and 
epsilon-greedy parameter 0.05. We only trained SARSA 
for 500 episodes, because of the time required to sample 
an episode. After training, we extracted a stationary tar¬ 
get policy, which ensured all actions could be selected in 
all states with probability at least e = 0.05. 

Figure 5 shows the normalized evaluation error for 
MFMC, CIS, and G-SCOPE given H = 40 and H = 200 
trajectories from the behavior policy. Averages and stan¬ 
dard deviations are estimated over 5 independent trials. 
Again, the evaluation error of G-SCOPE is much smaller 
than MFMC and CIS. In fact, MFMC and CIS perform no 
better than a strategy that always predicts the target pol¬ 
icy’s value is 0. The poor performance of MFMC is due to 
the impossibility to construct artificial trajectories from 
samples in such a high dimensional space. 

6. Discussion 

We presented a finite sample analysis of G-SCOPE 
that shows how samples can be related to the evalu¬ 
ation error. When m < D, the sample complex¬ 
ity scales logarithmically with number of states, where 
to = arg max, e p] |$i|- 

Our experiments show that (1) model-based off-policy 
evaluation algorithms are more sample efficient than 
model-free methods, (2) exploiting structure can dramat¬ 
ically improve sample efficiency, and (3) G-SCOPE of¬ 
ten provides a good evaluation of the target policy de¬ 
spite using a greedy structure learning approach. Thus, 
G-SCOPE provides a practical solution for evaluating 
new policies. Our empirical evaluation on large and 
small FMDPs shows our approach outperforms existing 
methods, which only exploit trajectories. 

We analyzed G-SCOPE under three assumptions restrict- 

































ing the class of FMDPs that can be considered. These 
three assumptions imply that (1) including weak parent 
will not make any other weak parent (significantly) more 
informative than it was before, (2) strong parents are 
more relevant than non-parents, and (3) conditioned on 
the strong parents non-parents are non-informative. We 
believe that many real-world problems approximately 
satisfy these assumptions. If the problem under consider¬ 
ation does not satisfy them, then learning algorithms of 
combinatorial computational complexity in the number 
of state variables must be considered to correctly iden¬ 
tify the true parents (Chakraborty & Stone, 2011). 

To the best of our knowledge, this is the first model- 
based algorithm and analysis for off-policy evaluation in 
FMDPs. Moreover, G-SCOPE is a tractable algorithm 
for learning the structure of an FMDP even if no prior 
knowledge is given about the order in which variables 
should be considered. That being said, we hope that 
showing the effectiveness of structure learning for off- 
policy evaluation will encourage the adaptation of ex¬ 
isting algorithms for learning the structure of FMDPs 
and more generally dynamic Bayesian networks for off- 
policy evaluation. 
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A. List of Notations 


Notation 

Meaning 

A 

Action space 

T 

Time horizon 

t 

Time index t = 0..T 

H 

Number of trajectories in batch data 

D 

Number of factors in each state 

[D] 

The set 1, 2,.., D. 

r 

Domain of each factor in a state and (dual notation) the number of possible values for the factor 

M 

Markov Decision Process 

P 

Distribution of first state in MDP 

X 

Input variable (represents previous state) 

Y 

Output variable (represents next state) 

m 

The i’th variable in the output. 


A subset of indices 

=£(*) 

The corresponding subset of variables to 


Indices of the parents for variable i 

nn 

maxi 

Fi 

pl't’il x /p the se i; 0 f a u realization-action pairs for the parents of node i 


Indices found by G-SCOPE for variable i 

n{instance) 

Number of observations in the data fitting the instance 

0i 

The set of realization-action pairs observed more than N(e, 5) for each <1>, 

ipi 

A value signifying policies mismatch (bigger means higher mismatch) 


B. Proof of Main Theorem & Supporting Lemmas 

The proof of Theorem 1 is broken down into parts. 

B.l. The Simulation Lemma 

In this subsection, we derive a simulation lemma for MDPs, which essentially says that for a fixed policy two MDPs 
with similar transition probability distributions will result in similar value functions. Our simulation lemma differs 
from other simulation lemmas (e.g., Kearns & Singh 2002; Kakade 2003) in that we only need the guarantee to hold 
for the target policy. To formalize what we mean by “similar” MDPs, we introduce the following assumption. 

Definition 3. Let M = (S, A, P, R, p) be an MDP and K C S x A. M and K define an induced MDP Mk = 
(S, A, P K , Rk, p), where 


and 


P K (Y\X,a) = 


P(Y\X,a) if(X,a)eK 
1 if(X,a)(£KAY = X 

0 otherwise 


Rk{X_, a) 


R{X,a) if(X,a)eK 
0 otherwise 


Definition 4. Let e > 0, M = ( S , A, P, R, p) be an MDP, and K C S x A. An e-induced MDP M = (S, A, P, R, p) 
with respect to M and K, satisfies 


V(x,a)e^l|P(-|A,a)-P(-|A,a)|| 1 <e , 
V(x,a)tKVYesP(Y\X,a) = P K (Y\X,a ) ,and 
V( x,a)eSxAR(K, a) = Rk(X, a) . 


Assumption 4. A4(e,6,Tv) : Let e > 0, 6 £ (0,1], n be a policy, and M = {S, A, P, R, p). There exists an e- 
induced MDP M with respect to M and the subset of the ^ite-action space K C S x A, such that the probability of 

























( 11 ) 


encountering a state-action pair that is not in K while following it in M is small: 

Pr Pte[T](K t ,a t ) i K \M,tt\ <6 . 

Lemma 1. (Simulation Lemma) Suppose Assumption 4 holds with A4(e, 6, tr), then 

\v-v\ < ST + eT 2 , 


where v 

= P T VjLand v = p r V£ I . 


Proof. 

K 

■Ci 

II 

H 

"a 



= 1 p Tp M - (p t vz Ik - p t V^ k ) - p T VA I 

Insert 0 = - p T V^ K ) 


< 1 p Tp M - p T v^ K | + |p T y^ - p T vx 1 

By the triangle inequality. 


<<5T+|p T V^-p T FX| 

By (11). 


( 12 ) 


We represent by Pfj K , £ WL SxS and R £ the transition matrices and rewards induced by the policy tt. For any 

matrix A, we denote by ||A|| p the p-induced matrix norm || ■ ||. Notice that: 


P 7 T _ p 7 T 

r Mx r M 


n 

max E | P Mk ( s j I ■ s *» 7r ) - P s( s j I ■ s *» n ) 

3~ X 


Norm definition 


= max > 

i<i<s ' 
“ “ j =i 


^ ^ (Pmk (Sj |t>j, Cl) P m tt)) 


Policy decomposition 


< max y^7r(q|sj)y^ \PM K (sj\sj,a) — P^(sj\sj,a)\ Triangle inequality 

a j -1 

< max 7r(a|sj)e = e By Definition 4 

— — a 

In addition, we use the following result (page 254 in Bhatia 1997): For any two matrices X , Y and induced norm: 

\\X m - Y m \\ < mM™- 1 \\X - Y\\ , (13) 

where M = max(||X||, ||y||). Since Pm K ■ PP are stochastic, this inequality holds for the oo-induced norm with 
M = 1. Now: 


\p T V, 


Mk 


P TV Pi I = 


Y( p M K T R -p r Y( p Y R 


t =0 


t =0 




< w, 


£=0 
T 


£=0 

T 


£—0 £= 

T 

E* 


£—0 

\£ 


Pile 


t=0 


< 


07 T p 7 T 


t=0 


< eT 


Sum of rewards over steps 


Holder inequality and submultiplicative norm 

Triangle inequality and bounded reward 

Equation 13 for each summand with m = t 
Definition 4 as seen above 
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Therefore, we can combine the results to obtain: 


\v-v\< ST + eT 2 


(14) 

□ 


B.2. Bounding the 7^-error in Estimates of the Transition Probabilities 

In this subsection, we consider the number of samples needed to estimate the transition probabilities of various 
realization-action pairs. The samples we receive are from a trajectory. Each trajectory is independent. Unfortu¬ 
nately, samples observed at timestep t may depend on samples observed at previous timesteps. So the samples within 
a trajectory may not be independent. Therefore, we cannot apply the Weissman inequality (Weissman et al., 2003), 
which requires the samples to be independent and identically distributed. Instead, we derive a bound based on a 
martingale argument. 

Definition 5. A sequence of random variables X () . X \,... is a martingale provided that for all i > 0, we have 

E[|Xi|]<oo ,and (15) 

E [X i+1 | X 0 ,X 1 ,X 2 ,Xi] = X z . (16) 

Theorem 2. (Azuma’s inequality) Let e > 0 and X\, X 2 ,... be a martingale such that |A,; + i — X*| < 6,; for i > 1, 
then for all m > 1 

PrllXm-Xi] >e]<2exp( 2 £j; ) . (17) 

Definition 6. Let X\. X 2 ,..., X m be any set of random variables with support in T and f : T™ —> ffi. is a function. 

A Doob martingale is the sequence 

Bq = Ex 1 ..Y 2 ,...,x m [f(Xi,X 2 , ■ ■ ■ ,X m )] , and 

Bi = Ex j+ll x j+2 ,...,x ro [f(X i,X 2 , ■ ■. ,X m )\Xi,X 2 , ■ ■ ■, Xi] , fori = 1,2 ,... ,m . 


Lemma 2. Let e > 0 ,T be a finite set, X = (Xi,X 2 ,... ,X m ) be a collection of m > 1 random variables with 

-* m 

support in T generated by an unknown process, and f x (X) = — ^ I{Xi = x} for all x £ T. We denote by 

m i -1 


p(x) = E f x (X) 


for all x £ T. Then 


Pr 


I fx(X) ~ p(x)\ > £ 


< 2 exp 


9 

—em 


for all x G T and 

Pr [||£-Hlr > £ ] < 2|T| exp 

where fi(x) = f x {X). 


(18) 


(19) 


Proof First, notice that X and m ■ f x (■) define a Doob martingale such that /i l+1 — B, < 1 for i = 1, 2,... to. By 
applying Azuma’s inequality, we obtain 


Pr [\B m — B 0 \ > me] < 2 exp 


— (em.) 2 


Pr 


\f x (X)~ p(x)\ > £ 


< 2 exp 


—e 2 m 


which proves (18). 
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Now the union bound gives 


Pr 


®Gr 1 1 



which proves (19). 

Lemma 3. Let e, <5 > 0, and C [D\, if there are 


Ar 2T 2 , 2r 
w -^ log T 


samples of the realization-action pair (v, a) obtained from independent trajectories o/Vf,, then 

II Pr(y(i)|X(tt) = v, a) - R-(y(*)|X(^) = v, a)|| x < e , 


□ 


( 20 ) 


with probability at least 1 — 5. 

Proof. Since the samples are taken from the behavior distribution, Pr(Y(i) = y\X(f&) = v,a ) = n j * = 
Tftjf EfcLi = 2/> X k (V) =v,a k = a}. By Lemma 2: 

Pr(|| Pr(Y(i)\X(*) = v,a)~ Pi(Y(i)\X(V) = v, a)||r > e) < 2|P| exp (^) (21) 

Setting 6 = 2|P| exp(^p-) we obtain N = log ■ □ 

B.3. Bounding the Number of Trajectories 

In this subsection, we derive a bound on the number of trajectories needed to derive a model that evaluates the target 
policy accurately. Notice that the learned model does not need to be accurate everywhere - only the regions of the 
state space where the target policy is likely to visit (and in a FMDP only the parent realizations that the target policy 
is likely to visit). Our analysis takes advantage of this. When the behavior policy visits the parent realizations that the 
target policy is likely to visit, then the number of trajectories can be small. On the other hand, if the behavior policy 
never visits parent realizations that the target policy visits, then the number of trajectories may be infinite. 

We will make use of the following Proposition proved in Li (2009). 

Proposition 1. (Li, 2009, Lemma 56) Let k G N, p, S > (0,1), B\, B%,... B m be a sequence of m independent 
Bernoulli random variables such that E [£?,] > pfor i = 1, 2,..., m, and 

m > - + In , (22) 


then 


Pr 


£ 

,t=i 


Bi > k 


> 1-6 


(23) 


Proposition 1 tells us the number of experiments we need to perform on a Bernoulli distribution to observe at least k 
successes with high probability. The following corollary modifies the statement of Proposition 1 to tell us the number 
of experiments we need to perform to see a high probability set of outcomes from a categorical distribution at least k 
times with high probability. 

Corollary 1. (to Proposition 1) Let 5 G (0,1], k > 1, P be a finite set, p G Ad(r) be a probability dis¬ 
tribution with outcomes from T, and X\. X' 2 , .... X rn independent random variables sampled from p. Let 











= { x € r | I {A* = x} > k] be the set of elements encountered k or more times and S k 

its complement. If 


m > 


2 ir| 


k + In 


|r| 


then, with probability at least 1 — <5, 


Pr [x£S k m ] <6 , 


the set of outcomes visited less than k times has total probability mass less than S. 


T\Sm be 

(24) 

(25) 


Proof Consider an infinite sequence of random variables X 2 .... distributed according to p. Denote by j[ 1] < 
j\ 2] < ■ • • < j[k\ the indices resulting in the event that £ S k and A'j ^ S k _. Notice that we let an index j[l] be 
infinite in the case that the event never occurs. However, T only contains |P| elements, so an element can be added to 
S k at most |r| times. Notice that = S^h,^ = ... = 5% +;1 , ;| for Z = 1,2,..., |P| — 1. We construct Bernoulli 
random variables 


Bi = 


1 

0 otherwise 


(26) 


for i > 1. So |_i, B m+2 ,..., Bjyi_ |_x] are independent, identically distributed Bernoulli random variables for 
l = l,2,...,|r| - 1 (but B m and Bjy + i] are not independent). Suppose that Pr\Bpq = 1] > 6 for some l £ 
{1, 2, 3,..., |P|} (this is at least true for PrfZT^q = !] = !> <5), then by Proposition 1, (with p ■£- 5,5 £- X) 


j[l + !] ~ (M + 1 ) < T fc + ln 


l r l 


with probability at least 1 — ]py- Since there are only |P| elements in T, this can only happen at most |r| times. Thus, by 
the union bound, after m > ('k + In samples, either all |P| outcomes have been observed or Pr[B m = 1] < <5, 

with probability at least 1 — |P| j^t = 1 — 6. If we have observed all |r| elements then (25) holds trivially. On the other 
hand if Pr[i3 m = 1] < <5, then 


<5 > Pr [x i S k m _ i] , 
>Pr [xiS k m ], 

= Pr [* G S k m ] . 

X~p 


By the definition of B m (26). 
The probability of a success 
decreases because S!f l _ 1 C S ^ 


□ 

Proposition 2. Let S £ (0,1], k > 1, T be a finite set, p, p £ A4(r) be probability distributions with outcomes from 
r, and Xi, X 2 ,..., X n be independent random variables sampled from p. Let S k = {s £ T | Y^ii=i = x} > k} 
be the set of elements encountered k or more times and S k = P\S k be its complement. If 


then, with probability at least 1 — <5, 

where fi) = max pp (taking 2=0). 

Proof. We want to show Pr x ^ p [x £ S k ] < fi>5. By applying Corollary 1, we have that Pr x ^ p [x £ S k ] < S with 
probability at least 1 — <5. It suffices to show that Pr x ^ ;j e S k ] < Pr x ^ p [x £ S k ] < if>6. 


> 2 > r l (l ^1 > r l 


Pr \x £ Sp < # , 
L J 


(27) 


x~fl 


Pr \x £S‘]= ^ n{x) 

r,r>JU L J 

xes* 

= v( x ) 

xesb, 


= 


cesj 


P( x ) 

p(x) 

p{x) 

p(x) 


< 


f max ^-j^r ^ ^2 p( x ) 
\y er p(j/)/ ^ rv 


reSJ 


= ip Pr [x £ S n } . 

x~p 


□ 


For completeness we introduce the following proposition that is used to prove our lemma. 

Proposition 3. (Osband & Van Roy, 2014) Let Yfi) be a set of variables indexed by i £ [D], Vi a realization ofY (i), 
v = (vi, vd) and Pri, Pi '2 be tw ! o factorized probability distributions over Yj 


D 


Pr(Z) = n Pr (^W) J = L2 ■ 


i =1 


Then 


D 


I Pr(Z = v) - Pr(F = u)||i < ^ || Pr(F(i) = vf) - Pr(F(i) = n^Hi 


(28) 


(29) 


Lemma 4. Let e, <5 > 0. If the number of trajectories 


H > 


4ADr m /2r 2 , /4ADr m+1 


in 




in 


(2ADT m 

[ s~' 


then, with probability at least 1 — <5, there is a subset of state-action pairs 

K = j(X,a) 6 Sxi| ||Pr(F|X,a) -K(F|X,a)||i < De J 


such that: 


Pi[3 te[T] (X t ,a t )?K \M,n\ < 

Yft=i Pr(A' t ( l l > i)=D,at=aK) 


T J2f=i^i s 

2D 


(30) 


where tpi = max _ 7 . , .. 

Y ( v,a)GFi Et=l Pr(X t (*i)=i;,a t =a|7r6) 


Proof For every i £ [D], we define the random variable W: 

For a given trajectory sample a time t uniformly and set W = (A' t ($j), a t ). 

Notice that W is distributed according to the distribution induced by the behavior policy ~/, and that W receives one 
of AT^I values. We denote the distribution of W_ by p and over the target policy by p. Setting k = in (§ 7 ) and 
using Proposition 2 we obtain that having: 


2 ^ 1 ^/2r 2 , /2r 


- In 


AT^ 


(31) 












samples from p , with probability at least 1 — 52, 


Pr 


(v, a ) : n(v, a) < —— In 


— In fffi 


V 


< Ipifa 


where ^ 


„ M(t’.n) _ ELl Pr(Xt($i)=u,a t =aK) 

Ka)GF, 


(taking § = 0). 


By Lemma 3 and given our choice for k = JV(e, 5i), if we have observed N(e, 5i) samples from Pr(y(*)|X($») = 
v, a), then our estimate Pr satisfies 


Pr(F(f)|(X($ i ) = it, a)) - Pr(y(i)|QC($») = v,a)) 

with probability at least 1 — <$i. Now denote by 


< e , 


K, = 


|(u,a)eP 1 i | Pr(F(f)|(X($i) = v,a)) - Pr(y(i)|(X($i) = v,a)) ^ < e| , 


the set of realization-action pairs for predicting the i th output variable where the empirical distribution estimated from 
trajectory data is e-close to the true distribution. 

By applying the union bound over at most Al rl ‘‘ realization-action pairs, after H trajectories, we have 


Pr [3te[T|QCi( $ i)>®t) i R i I M ^\ < Ti/>i6 2 , 

with probability at least 1 — (5 2 + <5i3irl $i l). By applying the union bound again over all D output variables, we 
obtain 

D D 

[ 3 tE[T](X t ($i),at) i Ki | M, 7r] < 

i =1 i= 1 

with probability at least l — D(52 + AT^^i). Notice that this implies 


D 

Pr [3 te[T \(X t ,a t ) i K \ M,n\ < ^ Pr [3 te[T] (X t ($ i ),at) t K t \ M,n} 

i— 1 

D 

< t XI, 

i= 1 


holds with probability at least 1 — D(S 2 + > 1 — D(5 2 + Ar m 8i). 

The bound over || PrfPjX, a) — Pr(YjX, a)||i directly results from Proposition 3. 
Setting: 


- = DS 2 => 5 2 
S - =5^AD\T\ m 



5 

2ADT m 


(32) 


We can rewrite H in terms of e, 8: 


H > 


4ADT m 

8 


/2r 2 , UADT m+1 \ , f2ADT m 

ln c—^J +ln c—^— 


(33) 


□ 
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B.4. Error due to Greedy Parent Selection 

Lemma 5. Suppose Assumptions 1, 2 and 3 hold. Let e > 0.<5| > 0, and 


After applying G-SCOPE, for every i £ [D\, (v,a) £ 0;, and every w £ satisfying w(<!>,;) = u(<J>j), N(w, a) > 
Niefyfy- 

|| Pr(y(i)|X($j) = w,a) — Pr(y(i)| Jf ($,) = v, a)||i < (4D + l)e + D 2 C 3 (34) 

with probability at least 1 — 2 D(m + 1 ){D + 1 — m)AT m+1 Si. 

Proof. This Lemma is only concerned about realization-action pairs for which there are enough samples. G-SCOPE 
will not consider the score of realization-action pairs that do not have enough sample. When constructing the structure, 
this automatically discard realization-action pairs containing non-parents that do not meet the number of samples 
required to have an estimation error bounded by e with high probability. Hence, in what follows, we will always 


consider the worse case where there are always enough samples to estimate such probabilities. 

To simplify notation, let 

a(k,v,v k ,a) = ||Pr(F(*)|X($ i ) = v,a)~ Pr(Y(i)|X(l> i U {fc}) = (u, v k ), a)||i (35) 

a(k,v,v k ,a) = || Pr(Y(i)|X($i) = v,a) - Pr(Y(i)|X($i U {k}) = (v, v k ), a)||i (36) 

a*(k) = max a(k, v, v k , a) (37) 

V,Vk jfl 

a*(k) = max a(k, v, v k , a) (38) 

v,Vk ,a 

(v*,v k ,a*) = max a(k, v, v k , a) . (39) 

v,Vk ,a 


We want to bound the probability that G-SCOPE adds any non-parent variable. The G-SCOPE algorithm can only 
select a variable k to add to the parent set only if the following necessary condition holds: 

a*(k) > max a*(j ) . 

j£D\&i 


We break up this first part of the proof into two distinct, successive cases. 

1. 3k £ that is not in <b, (G-SCOPE has not added all of the strong parents yet), and 

2. C <1^ (G-SCOPE has added all strong parents). 


Case 1 (G-SCOPE has not added all of the strong parents): 

Let k £ that has not been added yet (k (f_ <t>,) such that k verifies Assumption 1, and j be a non-parent variable. 
We know such a k and corresponding realization-action pair which had been exhibited N(w. a) times exist, since 
we assume there is at least one realization-action pair of the full parents with enough samples (since otherwise the 
requested bound holds trivially). We want to bound the probability that 

a*(k)-a*(j)> 0, (40) 


where 


max -v,v k ,a ||£r(Z(*)|^(^i) = v,a) - Pr(y(*)!*;($» U {k}) = (v, v k ), a)||i- 
rna Ky'y.y ||Pr(y(f)|X(4 i ) = v',a') - Pr(y(i)|X($ J ; U {j}) = a')||i • 


If (40) holds for all non-parents, then G-SCOPE will only add parents from <!»,. For (40) to hold, it is sufficient to have 

a*(k) - a(j,v,Vj,a) > 0 Wj £ [D]\$i,v,Vj,a , (41) 
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By applying the triangle inequality, we obtain 


(42) 


and 


(43) 


a{k) = || Pr(F(i)|X(<lj) = v,a)~ Pr(y(i)|X(6, ; ) = tt,a)||i < 

||P r (F(z)|X($i U {k}) = (v,v k ),a) - Pr(y(*)|X($j U { k }) = (v, v k ), a)||i+ 

||Pr(y(*)|=£(*i) =v, a)- KOKOI^^i) = «,o)|| 1 + 
a(fc) , 

a(j) = l|Pi'(r(*)l^(^ t ) =v,a)- Pr(y(i)|X(&i U {j}) = (w, Vj), a)||i < 
l|Pr(ZW®^i u {il) = {v,Vj),a) -Pr(Y{i)\X($iU{j}) = (v, Vj), a)||i+ 

II Pr(P(*)l^(^0 = v,a)~ Pr(y(i)|X($j U{.)}) = (v, vj), a)||i+ 

«(j) • 

By applying equations 42 and 43, Lemma 3 (with our choice of N(e, <5|)) and Assumption 1, 

a*(k) - a(j,v,Vj,a) > 

a*(k) 

-\\PT(m\m i U{k}) = (v*,vl),a*)-Py(Y(i)m^^{k}) = (v*,vl),a*)\\i 
- ||K(y(i)|2C($i) = v*,a*) - Pr(Y(i)jX($i) = rt*,a*)||r 

- l|P r (Z(*)|Z(^iU{. 7 }) = (v,Vj),a) — Pr(F(i)|X(l>j U {j}) = (v, vj), a)||i 

- ||Pr(y(*)|X($j) =v,a) -Pr(y(i)|X($i) = u,a)||i 

- a(j,v,Vj,a) 

> a*(k) — a(j, v, Vj, a) — 4e 

> Ci - 4e > 0 

with probability at least 1 — 4<5i (union bound) for a particular v, Vj,a if Ci > 4e. This holds for all j, v, Vj, a with 
probability at least 1 — (2 + 2 {D — m)Arl'*’ i l+ 1 )(> 1 (union bound again). 

This also means all variables in 4>f will all be detected by G-SCOPE. Indeed, using the triangle inequality, the same 
bounds on 


|Pr(y(*)|X&U{fc}) = («*, v *),a*)-Pr(y(i)|X($iU{fe}) = (v*, t*), a*)IK 
IPteOl^t) =t>*,a*)-Pr(y(t)| ; X:($ i ) =t;* ) o *)|| 1 


(44) 

(45) 


then above, the fact that (Assumption 1) 


a*(k) > max a*(j) + Ci > C\ , 

je[D]\&i 


and the fact that Ci > 4e + C 2 we have 

a*(k) > 


a*(k) 

- II Pr(y(*)|X(^i u {k}) = (v*, v k ), a*) - FT(y(i)|X($i U {k}) = («*, v* k ), a*)||i 

- ||Pr(y(*)|2C(*i) = v*,a*) - Pr(y(*)|X($i) = ^*,a*)||! 

> Ci - 2e > C 2 + 2e , 


with probabilities 1 — 2S±. 
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Notice that we made Assumption 1 much stronger than needed as we demanded the strong parent to stand out for all 
its possible realizations. For the proof, we only need to ensure that at least one realization verifying Assumption 1 is 
seen enough times to make sure a strong parent is preferred. Alternatively, we could modify assumption 1 to bound 
the probability of not acquiring enough samples for a particular realization that has a sufficiently large score. This 
would have negligible impact on the bounds of this lemma, the assumption would be weaker, but its presentation in 
the body of the paper would be more complex. 

Case 2 (G-SCOPE has added all strong parent variables): 

Now, we bound the probability that G-SCOPE adds a non parent variable j if all strong parents variable <E>f have 
already been added, that is, C C 

2(j) < 

l|Pr(n*)l^(^i u {i}) = (v,Vj),a) - Pr(y(i)|X($j U {j}) = (v, Vj), a)||i + 

II Pr(y(*)|2C($0 =v,a)~ P?(y(*)|2C($i) = v,o)||i + 

a(j) < Cl + 2e, 

< C 2 because of Assumption 2 


with probability at least 1 — (according to Lemma 3) for a particular v, Vj , a and with probability at least 1 — 2 (D — 

m )AT^'\ +1 5! for all v,Vj, cl. 

Combining Case 1 & 2: 

These two points must hold for all stages of the algorithm. 

• They must also hold for each iteration building <!>, . Iterations in the first step correspond to all strong parents, and 

4)" ' 1 e $5", the weak parents added in step 1 (before all strong parents are included). The number of iterations 
in the second point is at most all remaining weak parents <I>”‘ 2 C 1 added in the second step, plus one 

(when the algorithm stops). Note that the probability the first point holds is only 1 — (2 + 2(D — TO)ATl $i l +1 )<5i 
and not 1 — (4 + 2 (D — m)Arl $i l +1 )(5i because we are using the same two bounds involving k twice. 

• They must hold for all D target variable i. 

Let k = 2 (D — m)AT m+1 > 2 (D — m)Arl^ >i l +1 . Using the union bound, these points hold for all stages of the 
algorithm with at least probability 

1 - max ((|<E>f | + |4>^ ,:l |)(2 + k) + (| < E’“’ 2 | + 1)k) 5iD > 1 — (max |$j|(2 + k) + k)D5 1 (46) 

> 1 - (2m + (to + 1)k)DSi (47) 

> l-2D[TO+(TO + l)(D-m)Ar rn+1 ](5 1 (48) 

Transitioning from Probabilities over $, to Probabilities over <!»,: 

We define 4>( : to be the union of T, with the first k variables in ( t>, \ <l>, to be added greedily (according to the true 
probabilities) for the specific (w, a) pair. Also, denote w = ( v , ' 4> ‘ ). 

||Pr(Z(*)|Z(^i) = (v,v),a) - Pr(Y(i)\X($i) = u,«)||i 

< E l|Pr(Z(i)im fe ) = K^) 5 «)-Pr(ZWim fc_1 ) = K^- 1 ).«)lli ( 49 > 

k=1 

+ \\PT(Y{t)\X(^ i )=v,a)-^(Y{t)\X(^ i )=v,a)\\ 1 . 

The inequality is due to the triangle inequality - we observe the quality of adding each additional parent, and are left 
with the estimation error on v. Since the parents were added greedily, by Assumption 3 we can form a bound for the 


sum. Since we have enough samples of v (it’s in (-), ) the second term is small with high probability (by Lemma 3): 

< m\\ Pr(y(()|X($-) = (v,V! ),a) - Pr(X(i)|X($i) = u,a)||i +m 2 C 3 + e , 

< m||Pr(F(i)|X($ I 1 ) = (u,ui),a) - Pr(y(OI^($i) = (u,ui),a)||i 

+ m|| Pr(y(«')|X($i) =v,a) - Pr(Y(i)|X($i) =u,a)||i 
+ m||Pr(F(i)|X($-) = (u,ui),a) - Pr(y(j)|X($i) = u,a)||i + m 2 C 3 + e . 

Where the inequality holds from the triangle inequality. Similar to before, the first two summands can be bounded by e 
with probability 1 — <5i. The third summands is bounded by the algorithm - since v\ was not added to T,, and there were 
enough samples from it ( N(w, a ) > 7V(e, <5i)), it is necessarily smaller than the threshold 2e + C 2 , with probability 
1 — 2<5i for a specific i and 1 — 2D5i for all of them. Therefore, the difference is bounded by: (4m+ l)e+mC 2 +m 2 C 3 
for these probabilities. 

Everything together: 


II Pr(y(*)|X(4>i) = ( v,v),a ) - Pr(Y(i)|X(i>j) = a) || 1 < (4 m + l)e + mC 2 + m 2 C 3 (51) 

with at least probability (union bound) 

1 - 2D [to + (to + 1)(D - m)v4r m+1 ] S 1 - 2D5 1 = 1 - 2 D{m + 1) [l + {D - m)AT m+1 ] (52) 

which is lower bounded by 1 — 2 D(m + 1)(D + 1 — rn)AT rn+1 S\ or 1 — 2 D{D + \) 2 AT m+1 5\ 

□ 


B.5. Proof of Theorem 1 

Theorem 1. Suppose Assumptions 1, 2 and 3 hold. Let ^ > e + e > 0, 5\ >0, and m = max ie p] |$,|, then 
there exists 

= i”(£)) 

such that if G-SCOPE is given H trajectories, with probably at least 1 — 2AD(m + 2) (D + 1 — m)T m+1 5i, G-SCOPE 
returns an evaluation of it satisfying: 

\v-v\ <S*T + e*DT 2 (53) 

where 


e* = (4m + l)e + mC 2 + m 2 C 3 


D 

5* =T^fj i AT™ Si 
1=1 


ibi = max 

(v,a)eFi 


ELi = v i a t = a\tr) 

Etli Pr (^t( $ 0 =v,a t = a|7r & ) 


(54) 


Proof. 1. By Lemma 4, given 




AADT m 

J' 



4Aor m+1 \ 
S' ) 


+ In 


2ADT m \ \ 

)) 


trajectories there is a partition of P into more (set K) and less likely (v, a) pairs with probability at least 1 — S'. 
Pairs in set K are seen at least N(e, <5i) times. 

2. Since these pairs in K are seen at least N(e. 4|) times. Lemma 5 provides a bound on the estimation error on 
the conditionnal transition probabilities in the FMDP constructed by GSCOPE that holds with probability at least 

1 - 2D(to + 1)(D + 1 - TO)Ar m+1 ,5i. 
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3. This FMDP is therefore an De* -induce MDP with respect to the original MDP and K (Definition 4, Proposition 
3 and Lemma 4). 

4. Therefore, ^44(.De*, S*,n) is verified with probability at least (union bound on steps 1 and 2) 

1 - (1 + (to + 1 ){D + 1 - m)T)5' > 1 - (to + 2 ){D + 1 - m)TS' 
for 

• e* = (4 to + l)e + U 1 C 2 + to 2 C 3 , 

• S* =Tj:f =1 ^5'/2D. 

5. These values are then substituted into the simulation Lemma, and we replace S' = 2AI)\' m d-\ (equation 32) to 
obtain the specified result. 

□ 
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